Ranking coherence in topic models using statistically validated networks

نویسندگان

چکیده

Probabilistic topic models have become one of the most widespread machine learning techniques in textual analysis. Topic discovering is an unsupervised process that does not guarantee interpretability its output. Hence, automatic evaluation coherence has attracted interest many researchers over last decade, and it open research area. This article offers a new quality method based on statistically validated networks (SVNs). The proposed probabilistic approach consists representing each as weighted network probable words. presence link between pair words assessed by validating their co-occurrence sentences against null hypothesis random co-occurrence. allows to distinguish high-quality low-quality topics, making use battery statistical tests. significant pairwise associations represented links SVN might reasonably be expected strictly related semantic topic. Therefore, more connected network, coherent question. We demonstrate effectiveness through analysis real text corpus, which shows measure correlated with human judgement than state-of-the-art measures.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Statistically Validated Networks in Bipartite Complex Systems

Many complex systems present an intrinsic bipartite structure where elements of one set link to elements of the second set. In these complex systems, such as the system of actors and movies, elements of one set are qualitatively different than elements of the other set. The properties of these complex systems are typically investigated by constructing and analyzing a projected network on one of...

متن کامل

Improving Topic Coherence with Regularized Topic Models

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e.g. web search result snippets or blog posts...

متن کامل

fault location in power distribution networks using matching algorithm

چکیده رساله/پایان نامه : تاکنون روش‏های متعددی در ارتباط با مکان یابی خطا در شبکه انتقال ارائه شده است. استفاده مستقیم از این روش‏ها در شبکه توزیع به دلایلی همچون وجود انشعاب‏های متعدد، غیر یکنواختی فیدرها (خطوط کابلی، خطوط هوایی، سطح مقطع متفاوت انشعاب ها و تنه اصلی فیدر)، نامتعادلی (عدم جابجا شدگی خطوط، بارهای تک‏فاز و سه فاز)، ثابت نبودن بار و اندازه گیری مقادیر ولتاژ و جریان فقط در ابتدای...

Optimizing Semantic Coherence in Topic Models

Large organizations often face the critical challenge of sharing information and maintaining connections between disparate subunits. Tools for automated analysis of document collections, such as topic models, can provide an important means for communication. The value of topic modeling is in its ability to discover interpretable, coherent themes from unstructured document sets, yet it is not un...

متن کامل

Optimizing Semantic Coherence in Topic Models

Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subspaces (topics) that are obviously flawed to hum...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Information Science

سال: 2023

ISSN: ['0165-5515', '1741-6485']

DOI: https://doi.org/10.1177/01655515221148369